We provide IT Staff Augmentation Services!

Data Engineer Resume

0/5 (Submit Your Rating)

Dallas, TX

SUMMARY

  • 7+ years of overall IT experience in a variety of industries, which includes hands on experience in Big Data and Dataware house ETL technologies.
  • Experience in all phases of software development life cycle in Agile, Scrum and Waterfall management process.
  • Expertise in Apache Hadoop ecosystem components like Spark, Hadoop Distributed File Systems (HDFS), Kafka, Hive, MapReduce, Hive, Sqoop, HBase, Zookeeper, Airflow, Snowflake, YARN, Flume, Pig, Nifi, Scala and Oozie.
  • In - depth understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming.
  • Expertise in writing Spark RDD transformations, actions, Data Frame's, Persistence (Caching), Accumulators, Broadcast Variables, Optimizing Broadcasts, case classes for the required input data and performed the data transformation using Spark-core.
  • Experience in creating Data frames using PySpark and performing operation on the Data frames using Python.
  • Experienced in ingesting data into HDFS from various Relational databases like MYSQL, Oracle, DB2, Teradata, Postgres using Sqoop.
  • Experience in working with NoSQL Databases like HBase, Cassandra and MongoDB.
  • Well versed with various Hadoop distributions which include Cloudera (CDH), Hortonworks (HDP) and MAPR distribution.
  • Expert in developing SSIS/DTS Packages to extract, transform and load (ETL) data into data warehouse/data martsfrom heterogeneous sources.
  • Hands-on experience withAWS technologies like EC2, S3, DyamoDB, RedShift, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda.
  • Hands on experience on building the infrastructure required for optimal extraction, transformation, and loading (ETL) of data from a wide variety of data sources using NoSQL and SQL from AWS & Big Data technologies (Dynamo, Kinesis, S3, HIVE/Spark).
  • Fashioned on different libraries related to Data science and Machine learning like Pandas, NumPy, SciPy, Matplotlib, Seaborn, Bokeh, nltk, Scikit - learn, OpenCV, TensorFlow, Theano and Keras.
  • Experience in Data Modeling with expertise in creating Star & Snow-Flake Schemas, FACT and Dimensions Tables, Physical and Logical Data Modeling using Erwin and Embarcadero.
  • Strong experience in Business and Data Analysis, Data Profiling, Data Migration, Data Integration, Data governance and Metadata Management, Master Data Management and Configuration Management.
  • Good understanding of Open shift platform in managing Docker containers using Docker Swarm, Kubernetes Clusters.
  • Experience in working with various version control tools like SVN, GIT and CVS.
  • Experience in various automation tools like Terraform, Ansible.
  • Experience in working with CI/CD pipeline using tools like Jenkins and Chef.
  • Experience in Data Warehousing applications, responsible for the Extraction, Transformation and Loading (ETL) of data from multiple sources into Data Warehouse.
  • Experience building and optimizing ‘big data’ data pipelines, architectures and data sets.
  • Experienced writing Test cases and implement unit test cases using testing frameworks likeJ-unit, Easy mockandmockito.
  • Experience in project management and Bug Tracking tool such as JIRA and Bugzilla.

TECHNICAL SKILLS

Programming Languages: Python, R, SQL, Scala, Pig

BigData Eco System: Hadoop, MapReduce, HDFS, HBase, Zookeeper, Nifi, Hive, Airflow, Pig, Hive, Sqoop, Oozie, Storm, and Flume.

Spark Technologies: Spark SQL, Spark RDD, Spark Streaming, Spark Core, Data Frames.

Databases: Oracle, MySQL, SQL Server, PostgreSQL, NoSql - MongoDB, Cassandra, Hbase, DynamoDB.

Hadoop Platforms: Cloudera, Hortonworks, MapReduce

Cloud Technologies: AWS, EC2, S3, RedShift, RDS, VPC, Docker, Kinesis, Lambda, API Gateway.

Methodologies: Agile, Scrum, Waterfall

Version Control Tools: GIT, CVS, SVN

Build Tools: ANT, Maven, Jenkins

Operating Systems: Unix/Linux, Windows, Mac OS.

PROFESSIONAL EXPERIENCE

Confidential, Dallas, TX

Data Engineer

Responsibilities:

  • Used Spark-SQL to load JSON data and create schema RDD and loaded it into Hive Tables and handled structured data using Spark SQL.
  • Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
  • Used Spark SQL to migrate the data from hive to python using pyspark library.
  • Developed spark programs using Scala API's to compare the performance of spark with HIVE and SQL.
  • Used Redshift, S3 within AWS Cloud Services to load data into s3 bucket.
  • Worked with HDFS file formats like Avro, Sequence File and various compression formats like Snappy.
  • Performed data wrangling to clean, transform and reshape the data utilizing pandas library.
  • Analyzed data using SQL, Scala, Python, Apache Spark and presented analytical reports to management and technical teams.
  • UsedZookeeperto store offsets of messages consumed for a specific topic and partition by a specific Consumer Group in Kafka.
  • Implement CI/CD pipeline for Code Deployment using Jenkins.
  • Developed Sqoop and Kafka Jobs to load data from RDBMS, External Systems into HDFS and HIVE.
  • Involved in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
  • Involved in Creating, Debugging, Scheduling and Monitoring jobs using Airflow for ETL batch processing to load into Snowflake for analytical processes.
  • Developed a Spark Streaming module for consumption of Avro messages from Kafka.
  • Implemented Dimensional Data Modeling to deliver Multi-Dimensional STAR schemas and Developed Snowflake Schemas by normalizing the dimension tables as appropriate.
  • Worked on ingestion of applications/files from one Commercial VPC to OneLake.
  • Optimize the Pyspark jobs to run on Kubernetes Cluster for faster data processing
  • Developed Scala scripts using both Data frames/SQL/Data sets and RDD in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
  • Worked with the ETL team to document the transformation rules for Data migration from OLTP to Warehouse environment for reporting purposes.
  • Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
  • Involved in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie.
  • Involved inmoving the raw databetween different systems using Apache Nifi.
  • Précised Development and implementation of several types of sub-reports, drill down reports, summary reports, parameterized reports, and ad-hoc reports using Tableau.
  • Consumed XML messages using Kafka and processed the xml file using Spark Streaming to capture UI updates.
  • Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in Near real time and persist it to Cassandra.
  • Implemented a 'server less' architecture using API Gateway, Lambda, and Dynamo DB and deployed AWS Lambda code from Amazon S3 buckets.
  • Responsible for implementing monitoring solutions in Ansible, Terraform, Docker, and Jenkins.
  • Used Jira for bug tracking and Bitbucket to check - in and checkout code changes
  • Worked in Production Environment which involves building CI/CD pipeline using Jenkins with various stages starting from code checkout from GitHub to Deploying code in specific environment.
  • Organize and implement different Load Balancing solutions for PostgreSQL cluster
  • Followed agile methodology and actively participated in daily scrum meetings.

Environment: Hadoop, Spark, Scala, Hive, Python, AWS, Ec2, S3, VPC, RDS, RedShift, DynamoDB, Tableau, HDFS, PySpark, PostgreSQL, Jenkins, SQL, Cassadra, HDFS, HBase, Snowflake, JIRA, GitHub, Agile.

Confidential, Dallas, TX

Data Engineer

Responsibilities:

  • Implemented Spark using Python/Scala and utilizing Spark Core for faster processing of data instead of MapReduce in Java.
  • Created data pipeline for different events of ingestion, aggregation and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for tableau dashboards.
  • Developed Python scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
  • Developed Preprocessing job using Spark Data frames to flatten Json documents to flat file.
  • Analyzed large and critical datasets using big data tools such as HDFS, MapReduce, Hive, Hive UDF, Pig, Sqoop and Spark.
  • Ingested data from RDBMS and performed data transformations, and then export the transformed data to Cassandra as per the business requirement.
  • Scheduled Oozie workflow engine to run multiple Hive and Pig jobs, which independently run with time and data availability.
  • Used Talend for Big data Integration using Spark and Hadoop.
  • Implementing and Managing ETL solutions and automating operational processes.
  • Used python Boto 3 to configure the services AWS EC2, S3
  • Worked on cloud deployments using maven and Jenkins.
  • Responsible for creating on-demand tables on S3 files using using Python.
  • Wrote scripts and indexing strategy for a migration to Confidential Redshift from SQL Server and MySQL databases.
  • Worked extensively with Dimensional modeling, Data migration, Data cleansing, ETL Processes for data warehouses.
  • Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage.
  • Worked on custom Pig Loaders and Storage classes to work with a variety of data formats such as JSON, Compressed CSV etc
  • Involved in running Hadoop Streaming jobs to process Terabytes of data
  • Used JIRA for bug tracking and GIT for version control.
  • Followed agile methodology for the entire project.

Environment: Hadoop, Spark, Scala, Hive, Map Reduce, Python, AWS, Ec2, S3, VPC, Tableau, HDFS, Jenkins, SQL, MySQL, Cassandra, HDFS, HBase, JIRA, GitHub, Agile.

Confidential, Dallas, TX

Data Engineer

Responsibilities:

  • Involved in designing and deployment of Hadoop cluster and different big data analytic tools including Pig, Hive, Hbase and Sqoop.
  • Developed simple and complex MapReduce programs in Hive, Pig and Python for Data Analysis on different data formats.
  • Performed data transformations by writing MapReduce and Pig scripts as per business requirements.
  • Developed Talend Bigdata jobs to load heavy volume of data into S3 data lake
  • Installed Oozie workflow engine to run multiple Hive.
  • Implemented Map Reduce programs to handle semi/unstructured data like xml, json, Avro data files and sequence files for log files.
  • Developed various Python scripts to find vulnerabilities with SQL Queries by doing SQL injection, permission checks and analysis.
  • Managed and reviewed Hadoop and HBase log files.
  • Imported semi-structured data from Avro files using Pig to make serialization faster
  • Involved in making Hive tables, stacking information, composing hive inquiries, producing segments and basins for enhancement.
  • Managing and scheduling Jobs on a Hadoop Cluster using UC4( Confidential preoperatory scheduling tool) workflows.
  • Involved in working with messaging systems using message brokers such as RabbitMQ.
  • Continuous monitoring and managing the Hadoop cluster through Hortonworks (HDP) distribution.
  • Created data pipeline package to move data from amazon S3 bucket to MYSQL database and executed MySQL stored procedure using events to load data into tables.
  • Designing and building multi-terabyte, full end-to-end Data Warehouse infrastructure from the ground up on Confidential Redshift for large scale data handling Millions of records every day
  • Worked on Big data on AWS cloud services i.e. EC2 and S3
  • Involved in loading data from UNIX file system and FTP to HDFS.
  • Developed UDF's in java for enhancing functionalities of Pig and Hive scripts.
  • Worked on complex SQL Queries, PL/SQL procedures and convert them to ETL tasks.
  • Wrote various data normalization jobs for new data ingested into Redshift
  • Migrated on premise database structure to Confidential Redshift data warehouse
  • Was responsible for ETL and data validation using SQL Server Integration Services.
  • Defined and deployed monitoring, metrics, and logging systems on AWS.
  • Worked with Linux systems and RDBMS database on a regular basis in order to ingest data using Sqoop.
  • Experience indata cleansing and data mining.
  • Involvedin writing UNIX shell scripts and automation of the ETL processes using UNIX shell scripting.
  • Worked with utilities like TDCH to load data from Teradata into Hadoop.
  • Involved in Scheduling jobs using Crontab.
  • Followed Agile methodology including, test-driven and pair-programming concept.

Environment: Hadoop, AWS, Python, Hive, Pig, SQL, Unix/Linux, Agile, Sqoop, Mapreduce, HBase, HDFS.

Confidential

Data Engineer

Responsibilities:

  • Written Spark applications using Scala to interact with the database using Spark SQL Context and accessed Hive tables using Hive Context.
  • Involved in designing different components of system like big-data event processing framework Spark, distributed messaging system Kafka and SQL database.
  • Implemented Spark Streaming and Spark SQL using Data Frames.
  • Created multiple Hive tables, implemented Dynamic Partitioning and Buckets in Hive for efficient data access.
  • Involved in converting MapReduce programs into Spark transformations using Spark RDD on Python.
  • Implemented Sqoop jobs for large data exchanges between RDBMS (Mysql, Oracle) and Hive clusters.
  • Worked extensively with Dimensional modeling, Data migration, Data cleansing, ETL Processes for data warehouses.
  • Designed tables and columns in Redshift for data distribution across data nodes in the cluster keeping columnar database design considerations
  • Involved in creating Hive External tables, also used custom SerDe's based on the structure of input file so that Hive knows how to load the files to Hive tables.
  • Used PySpark data frame to read text data, CSV data, and image data from HDFS, S3 and Hive.
  • Managed large datasets using Panda data frames and MySQL
  • Maintained Tableau functional reports based on user requirements.
  • Performed CI/CD operations with Gitlab pipelines, Jenkins and Docker.
  • Well versed in using Data Manipulations, Compactions, inCassandra.
  • Used Python for pattern matching in build logs to format warnings and errors.
  • Monitor System health and logs and respond accordingly to any warning or failure conditions.
  • Worked on scheduling all jobs using Oozie.

Environment: Oozie 4.2, Kafka, Spark, Spark SQL, Tableau, Shell Script, SQOOP, Scala

We'd love your feedback!