Data Engineer Resume
New York, NY
SUMMARY
- Over 7+ years of IT experience in the Health, Banking, Insurance, and E - commerce domain, Design, Development, Maintenance and Support of Big Data Applications and JAVA/J2EE. Over 6+ years of experience in Analysis, Design, Development, and Implementation as a Data Engineer.
- Strong exposure to Spark, Spark Streaming, Spark MLlib frameworks and developing production ready Spark application using both Scala and Python programming interfaces.
- Hands on experience in working with Spark and import data from different data sources like storage layers, kafka, databases etc., perform transformations, save the results to different destinations.
- Worked extensively on fine tuning spark applications to improve performance and troubleshooting failures in spark applications.
- Experience working with Spark Streaming and Kafka for building reliable streaming pipelines and ability to troubleshoot and finetuning streaming application to handle and recover from failures.
- Extensively worked on Spark using Scala on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL/Oracle.
- Good understanding ofHadooparchitecture and various components in Big data ecosystem.
- Experienced working with Hadoop distributions both on-prem (CDH, HDP) and in cloud (AWS).
- Good experience working with various data analytics and big data services in AWS Cloud like EMR, Redshift, S3, Atana, Glue etc.,
- Used hive extensively to performing various data analytics required by business teams.
- Solid experience in working various data formats like Parquet, Orc, Avro, Json etc.,
- Expertise in writing dynamic-SQL, complex stored procedures, functions, and views.
- Excellent understanding and noledge of NOSQL databases likeMongoDB, HBase and Cassandra.
- Experience in importing and exporting data usingSqoopfromHDFSto Relational Database Systems and vice-versa.
- Experience in Object Oriented Analysis DesignOOADand development of software using UML Methodology good noledge of J2EE design patterns and Core Java design patterns.
- Experience in managing Hadoop clusters usingCloudera Manager tool.
- Very good experience in complete project life cycle design development testing and implementation of Client Server andWeb applications.
- Storing data in storage layer of Apache Ignite and making webservices run on its own and display the web output on AWS.
- Good working noledge on Snowflake and Teradata databases.
- Hands on experience in Sequence files, RC files, Avro, Parquet, RC File and JSON Combiners, Counters, Dynamic Partitions, Bucketing for best practice and performance improvement.
- Skilled in developing Java Map Reduce programs using java API and using hive, pig to perform data analysis, data cleaning and data transformation.
- Experience in Analyzing the SQL scripts and designed the solution to implement using Pyspark.
- Worked with join patterns and implemented Map side joins and Reduce side joins using Map Reduce.
- Developed enterprise applications using Scala.
- Developed multiple MapReduce jobs to perform data cleaning and preprocessing.
- Designed HIVE queries & Pig scripts to perform data analysis, data transfer and table design to load data into Hadoop environment.
- Debugged and improved the performance of hive SQL queries by adding partition columns.
- Converted Hive SQL to Spark SQL as part of the migration of pipelines.
- Expertise in writing Hive UDF, Generic UDF's to incorporate complex business logic into Hive Queries.
- Extensive experience on importing and exporting data using stream processing platforms like Flume and Kafka.
- Expertise in working with Hive data warehouse tool-creating tables, data distribution by implementing partitioning, bucketing, writing, and optimizing the HiveQL queries.
- Exposure in using build tools like Maven, SBT.
- Experience as a java Developer in client/server technologies using J2EE Servlets, JSP, JDBC and SQL.
- Expertise in designing and development enterprise applications for J2EE platform using MVC, JSP, Servlets, JDBC, Web Services, Hibernate and designing Web Applications using HTML5, CSS3, AngularJS, Bootstrap.
- Excellent interpersonal and communication skills, creative, research-minded, technically competent, and result-oriented with problem solving and leadership skills.
TECHNICAL SKILLS
Big Data Ecosystem: Spark, Hive, MapReduce, YARN, HDFS, Impala, Sqoop, Spark, Kafka and Oozie
Programming Languages: Java, Scala, and Python
Frameworks: Spring, Hibernate, JMS.
IDE: Eclipse, IntelliJ, PyCharm.
Databases: IBM DB2, Oracle, SQL Server, MySQL, RDBMS, HBase, Cassandra.
Tools: Tableau, Zoomdata, Talend.
Cloud Services: AWS S3, EMR, Atana, Redshift, Glue Metastore, Lambda functions, Azure Databricks.
Methodologies: Agile, Waterfall.
PROFESSIONAL EXPERIENCE:
Data Engineer
Confidential, New York, NY
Responsibilities:
- Responsible for ingesting large volumes of user behavioral data and customer profile data to Analytics Data store.
- Developed custom multi-threaded Java based ingestion jobs as well as Sqoop jobs for ingesting from FTP servers and data warehouses.
- Have experience in using Python with PySpark in building data pipelines and writing python scripts to automate pipelines.
- Developed many Spark applications for performing data cleansing, event enrichment, data aggregation, de-normalization and data preparation needed for machine learning exercise.
- Developed various spark applications using PySpark to perform various enrichments of user behavioral data (click stream data) merged with user profile data
- Creating S3 buckets also managing policies for S3 buckets and Utilized S3 bucket and Glacier for storage and backup onAWS.
- Designed and Implement test environment on AWS.
- Involved in Designing and Developing Enhancements of CSG using AWS APIS.
- Act as technical liaison between customer and team on all AWS technical aspects.
- Created pipelines to move data fromon-premises servers to Azure Data Lake.
- Utilized Azure HDInsight to monitor and manage one of our Hadoop Cluster.
- Experience with Azure Databricks in processing raw data from source systems and writing to destination delta lakes.
- Worked on troubleshooting spark application to make them more error tolerant.
- Utilized PySpark API to implement batch processing of jobs
- Worked on fine-tuning spark applications to improve the overall processing time for the pipelines.
- Wrote Kafka producers to stream the data from external rest api’s to Kafka topics.
- Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to HBase.
- Experienced in handling large datasets using Spark in Memory capabilities, using broadcasts variables in Spark, TEMPeffective & efficient Joins, transformations, and other capabilities.
- Worked extensively with Sqoop for importing data from Oracle.
- Experience working for EMR cluster in AWS cloud and working with S3.
- Involved in creating Hive tables, loading, and analyzing data using hive scripts.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Good experience with continuous Integration of application using Jenkins.
- Used Reporting tools like Tableau to connect with Atana for generating daily reports of data.
- Collaborated with the infrastructure, network, database, application, and BA teams to ensure data quality and availability.
Environment: AWS Cloud, Spark, Spark Streaming, Spark SQL, Python, PySpark, Scala, Kafka, Hive, Sqoop, HBase, Azure HDInsight, Tableau, AWS Simple workflow, Oracle, Linux.
Sr. Hadoop/Spark Developer
Confidential, Phoenix, AZ
Responsibilities:
- Involved in requirement analysis, design, coding and implementation phases of the project.
- Loaded the data from Teradata to HDFS using Teradata Hadoop connectors.
- Converted existing MapReduce jobs into Spark transformations and actions using Spark RDDs, Data frames and Spark SQL APIs.
- Written new spark jobs in Scala to analyze the data of the customers and sales history.
- Used Kafka to get data from many streaming sources into HDFS.
- Involved in collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
- Created data lake with Snowflake and built several data marts with presentable and modelled data.
- Good experience in Hive partitioning, Bucketing and Collections perform different types of joins on Hive tables.
- Created Hive external tables to perform ETL on data dat is generated on daily basics.
- Written HBase bulk load jobs to load processed data to Hbase tables by converting to HFiles.
- Performed validation on the data ingested to filter and cleanse the data in Hive.
- Created Sqoop jobs to handle incremental loads from RDBMS into HDFS and applied Spark transformations.
- Loaded the data into hive tables from spark and used parquet columnar format.
- Developed oozie workflows to automate and product ionize the data pipelines.
- Developed Sqoop import Scripts for importing reference data from Netezza.
Environment: Hadoop, HDFS, Hive, Sqoop, Kafka, Spark, Shell Scripting, Snowflake, HBase, Scala, Python, Kerberos, Maven, Ambari, Hortonworks, MySQL.
Hadoop Developer
Confidential, Sterling, VAResponsibilities:
- Developed custom input adaptors for ingesting click stream data from external sources like ftp server into S3 backed data lakes on daily basis.
- Created various spark applications using PySpark and Scala to perform series of enrichments of these click-stream data combined with enterprise data of the users.
- Implemented batch processing of jobs using Spark Scala API.
- Developed Sqoop scripts to import/export data from Teradata to HDFS and into Hive tables.
- Optimized Hive tables using optimization techniques like partitions and bucketing to provide better performance with Hive QL queries.
- Worked with multiple file formats like Avro, Parquet and Orc.
- Converted existing MapReduce programs to Spark Applications for handling semi structured data like JSON files, Apache Log files, and other custom log data.
- Wrote Kafka producers to stream the data from external rest api’s to Kafka topics.
- Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to HBase.
- Experienced in handling large datasets using Spark in Memory capabilities, using broadcasts variables in Spark, TEMPeffective & efficient Joins, transformations, and other capabilities.
- Worked extensively with Sqoop for importing data from Teradata.
- Implemented business logic in Hive and written UDF’s to process the data for analysis.
- Utilized AWS services like S3, EMR, Redshift, Atana, Glue Metastore etc., for building and managing data pipelines within the cloud.
- Automated EMR Cluster creation and termination using AWS Java SDK.
- Loaded the processed data to redshift clusters using Spark Redshift Integration.
- Created views with-in Atana for allowing downstream reporting and data analysis team to query and analyze the results.
Environment: Spark, Hive, HBase, Scala, Python, Shell Scripting, Amazon EMR, S3
Hadoop Developer
Confidential, Pittsburg, PA
Responsibilities:
- Developed Spark applications using Scala utilizing Data frames and Spark SQL API for faster processing of data.
- Developed highly optimized Spark applications to perform various data cleansing, validation, transformation, and summarization activities according to the requirement
- Data pipeline consists of Spark, Hive and Sqoop and custom build Input Adapters to ingest, transform and analyze operational data.
- Developed Spark jobs and Hive Jobs to summarize and transform data.
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
- Involved in converting Hive/SQL queries into Spark transformations using Spark DataFrames and Scala.
- Analyzed the SQL scripts and designed the solution to implement using Scala.
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
- Built real time data pipelines by developing Kafka producers and spark streaming applications for consuming.
- Ingested syslog messages parsed them and streamed the data to Kafka.
- Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive, Map Reduce and tan loading data into HDFS.
- Exported the analyzed data to the relational databases using Sqoop, to further visualize and generate reports for the BI team.
- Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis
- Analyzed the data by performing Hive queries (Hive QL) to study customer behavior.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Developed Hive scripts in Hive QL to de-normalize and aggregate the data.
- Scheduled and executed workflows in Oozie to run various jobs.
Environment: Hadoop, HDFS, HBase, Spark, Scala, Hive, MapReduce, Sqoop, ETL, Java
Java Developer
Confidential
Responsibilities:
- Involved in client requirement gathering, analysis & application design.
- Involved in the implementation of design using vital phases of the Software development life cycle (SDLC) dat includes Development, Testing, Implementation and Maintenance Support in Waterfall methodology.
- Developed the UI layer with JSP, HTML, CSS, Ajax, and JavaScript.
- Used Asynchronous JavaScript and XML (AJAX) for better and faster interactive Front-End.
- Used JavaScript to perform client-side validations.
- Involved in Database Connectivity through JDBC.
- Ajax was used to make Asynchronous calls to server side and get JSON or XML data.
- Developed server-side presentation layer using Struts MVC Framework.
- Developed Action classes, Action Forms and Struts Configuration file to handle required UI actions and JSPs for Views.
- Developed batch job using EJB scheduling and leveraged container managed transactions for highly transactions.
- Used various Core Java concepts such as Multi-Threading, Exception Handling, Collection APIs, Garbage collections for dynamic memory allocation to implement various features and enhancements.
- Developed Hibernate entities, mappings, and customized criterion queries for interacting with database.
- Implemented and developed REST and SOAP based Web Services to provide JSON and Xml data.
- Involved in implementation of web services (top-down and bottom-up).
- Used JPA and JDBC in the persistence layer to persist the data to the DB2 database.
- Created and written SQL queries, tables, triggers, views, and PL/SQL procedures to persist and retrieve the data from the database.
- Developed a Web service to communicate with the database using SOAP.
- Performance Tuning and Optimization with Java Performance Analysis Tool.
- Implement JUnit test cases for Struts/Spring components.
- JUnit is used to perform the Unit Test Cases.
- Used Eclipse as IDE and worked on installing and configuring JBOSS.
- Made use of CVS for checkout and check in operations.
- Deployed the components into WebSphere Application server
- Worked with production support team in debugging and fixing various production issues.
Environment: Java, JSP, HTML, CSS, AJAX, JavaScript, JSON, XML, Struts, Struts MVC, JDBC, JPA, Web Services, SOAP, SQL, JBOSS, DB2, ANT, Eclipse IDE, WebSphere.