Data Engineer | Big Data Resume
DenveR
SUMMARY
- Software professional having around 6+ years of IT experience with Big Data Ecosystem, Experience in ingestion, storage, querying, processing and analysis of big data.
- Extensive experience working in various verticals such as Confidential, Confidential, Confidential .
- Experience in creating applications using Spark with Python.
- Hands - on experience in using Hadoop ecosystem components like Spark, Airflow, MapReduce, HDFS, HBase, Zookeeper, Hive, Sqoop and Oozie.
- Importing and exporting data into HDFS and Hive using Sqoop.
- Worked on developing, monitoring and Jobs Scheduling using UNIX Shell Scripting.
- Experienced in installing, configuring, and administering Hadoop cluster.
- Worked in Data Engineering and Data Science teams building data ingestion pipelines, data lakes, machine learning models that translate data points into business insights
- Experience in working with MapReduce programs, Pig scripts and Hive commands to deliver the best results
- Experience with the Hive Query optimization and Performance tuning.
- Hands on experience in writing Pig Latin Scripts and custom implementations using UDF'S.
- Experience in tuning of Hadoop Cluster to achieve good performance in processing.
- Experience in upgrading the existing Hadoop cluster to latest releases.
- Experience in Data Integration between Pentaho and Hadoop.
- Experience in supporting data analysis projects by using Elastic MapReduce on the Amazon Web Services (AWS) cloud. Performed Export and import of data into S3.
- Well trained in Problem Solving Techniques, Operating System Concepts, Programming Basics, Structured Programming and RDBMS.
- Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data warehouse tools for reporting and data analysis.
- Strong analytical skills with ability to quickly understand client’s business needs. Involved in meetings to gather information and requirements from the clients. Leading the Team and involved in Onsite, Offshore coordination
- Experience includes Requirements Gathering/Analysis, Design, Development, versioning, Integration, Documentation, Testing, Build and Deployment
- Technical professional with management skills, excellent business understanding and strong communication skills.
TECHNICAL SKILLS
Languages: UNIX, SQL, Shell Script
Big Data and Hadoop Eco Systems: MapReduce, Sqoop, Hive, HDFS.
Databases: MySQL, PL/SQL
Cloud Computing: Amazon Web Services (EC2, EMR, S3, RDS)
Build Tools: Maven, Ant
Database Tools: SQL Developer, SQL Workbench
Development Tools: Eclipse, Putty, Intellij, Pycharm
Machine Learning: Linear Regression, Logistic Regression, Random Forest
Natural Language Processing: Regular Expressions, Spacy, Tokenization
PROFESSIONAL EXPERIENCE
Confidential, Denver
Data Engineer | Big Data
Responsibilities:
- Work closely with Engagement managers and Data Scientists to understand the requirements.
- Ingest and integrate different datasets from various sources.
- Developed data engineering and ETL python scripts for ingestion pipelines which run on AWS infrastructure setup of EMR, S3, Glue, SQS and Lambda.
- Teamed with multiple cross-functional engineers to elevate machine learning models from lower environments to till Production..
- Built spark applications to reduce workload on SQL.
- Creating pipelines using python with spark framework to ingest data from different sources to the client’s destination.
- Validating data by making data profiling around it with the databases like Hive.
- Creating service based applications using python to move data across accounts assuming the role(as direct access is denied).
- Move data from/to HDFS and create tables on top of them.
- Hive is used on top of Beeline for faster and better performance.
- Sqoop is used to move large datasets (history) to HDFS.
- Involved in extracting customer’s Big data from various data sources into AWS S3 for storing data. This included data from mainframes and databases from source.
- The Hive tables created as per requirement were managed or external tables defined with appropriate static and dynamic partitions, intended for efficiency.
- Created shell scripts to execute batch jobs as a CRON job part of POC.
- Used secure copy to fetch files from remote server and vice versa.
- Used RSA key gen and maintain public, private keys and use SSO for signing into edge nodes/ remote servers.
- Developed various components in Python that can be used in ETL batch processing jobs.
- Used Jenkins as the process of deploying the Airflow jobs/ Applications to the server.
- Used Git as source control management giving a huge speed advantage on centralized systems that have to communicate with a server.
- Built Deep Learning RNN’s and ANN on free text to effectively solve multi-class classification business problems in Pytorch for large datasets.
- Airflow workflow engines to manage interdependent jobs and to automate several types of Hadoop jobs such as Python MapReduce, spark, Hive and Sqoop as well as system specific jobs.
- Have knowledge in Snowflake which is used by neighbor teams but not worked directly over it.
- Worked in Agile methodology.
Tools: /Technologies: Hadoop, Python, Amazon Web Services, Spark, SparkSql, Hue, Hive, Jenkins, HDFS, Sqoop, Unix / Linux.
Confidential, Indianapolis
Data Engineer | Big Data
Responsibilities:
- Involved in extracting customer’s Big data from various data sources into AWS S3 on which datalake is located for storing data . This included data from mainframes and databases from source.
- Creating instances using AWS EMR for a short term as part of ETL jobs.
- Hive is used on top of Spark for faster and better performance .
- Developed various components on java using spring batch that can be used in ETL batch processing jobs.
- Acquire data from Transactional source systems to Redshift data warehouse using spark and AWS EMR.
- Administered SQL Stored Procedures, Triggers and complex queries. Made use of Aggregation and Materialized views to optimize query performance.
- Worked on extracting event data from KAFKA.
- Developed Python and shell scripts to schedule jobs.
- Used Sqoop to efficiently transfer data between databases and HDFS and used Flume to stream the log data from servers.
- The Hive tables created as per requirement were managed or external tables defined with appropriate static and dynamic partitions, intended for efficiency.
- Implemented Partitioning, Bucketing in Hive for better organization of the data.
- Developed UDFs in Hive.
- Deploying and maintaining production environments using AWS EC2 instances and ECS Docker.
- Developed python scripts to create batch jobs.
- Airflow workflow engine to manage interdependent jobs and to automate several types of Hadoop jobs such as Python, Hive and Sqoop as well as system specific jobs.
- Used Jenkins as the process of deploying the Airflow jobs to the server.
Tools: /Technologies: Amazon Web Services, Hadoop, Spark, Scala, SparkSql, Datalake, H2, Sqoop, Hive, Postgres, Python, Jenkins.
Confidential, Denver
Big Data | Hadoop Consultant
Responsibilities:
- Worked on analyzing Hadoop clusters and different big data analytic tools including Pig, Hive and Sqoop. Responsible for building scalable distributed data solutions using Hadoop Cloudera.
- Involved in importing and exporting data (SQL Server, Oracle, csv and text file) from local/external file system and RDBMS to HDFS. Load log data into HDFS using Flume.
- ETL Data Cleansing, Integration & Transformation using Pig: Responsible for managing data from disparate sources.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
- Data Warehousing: Designed a data warehouse using Hive, created and managed Hive tables in Hadoop.
- Created various UDF functions in Pig and Hive to manipulate the data for various computations.
- Created Map Reduce Functions for certain computations.
- Workflow Management: Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
- Created and maintained Technical documentation for launching Hadoop Clusters and for executing Hive queries and Pig Scripts.
- Extensive hands on experience in Hadoop file system commands for file handling operations.
- Worked on Sequence files, RC files, Map side joins, bucketing, partitioning for hive performance enhancement and storage improvement.
- Worked on developing, monitoring and Jobs Scheduling using UNIX Shell Scripting.
- Worked with parsing XML files using Map reduce to extract sales related attributes and store it in HDFS.
- Involved in building TBUILD scripts to import data from Teradata using Teradata Parallel transport APIs.
- Used Spark to enhance the performance of the project
- Also have a good knowledge of Scala.
- Good Knowledge and exposure to Cassandra.
- Worked in an Agile type of methodology.
Tools: /Technologies: Spark, Map Reduce, Hive, Python, Cloudera, Python and unix scripting.
Confidential
Associate Developer
Responsibilities:
- Designed and added new functionality extending existing applications using J2EE,XML,Ajax, Servlets, JSP.
- Created different batch programs to clean up tables in DB2 database.
- Extensively used Collections and Exceptions in batch program for database clean up.
- Worked on UNIX shell scripting to run the JAR file created for the batch program.
- Used Struts framework in UI designing and validations.
- Developed Action Classes, which acts as the controller in Struts framework.
- AJAX forms are created for update operations.
- Data was converted into JSON using JSP tags.
- Enhanced the existing application to meet the business requirement.
- Establishing JDBC connection using database connection pool.
- Wrote complex SQL statements to retrieve data from the DB2 database.
- Participated in the Production support and maintenance of the project.
- Created new tables in DB2 database.
- The application was developed using Eclipse on Windows XP. Deployed the application on Apache tomcat server6.0 on windows server 2003.
- Used ClearCase version control system.
- Performed usability testing for the application using JUnit Test.
Tools: /Technologies: JAVA, JavaScript, Ajax, JSON, Struts, Design Patterns, Eclipse, Apache tomcat server, DB2, UNIX, ClearCase, Junit